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We present a parallel algorithm that computes the ask and bid prices of an American option when propor- 
tional transaction costs apply to the trading of the underlying asset. The algorithm computes the prices 
on recombining binomial trees, and is designed for modern multi-core processors. Although parallel option 
pricing has been well studied, none of the existing approaches takes transaction costs into consideration. 
The algorithm that we propose partitions a binomial tree into blocks. In any round of computation a block is 
further partitioned into regions which are assigned to distinct processors. To minimise load imbalance the 
assignment of nodes to processors is dynamically adjusted before each new round starts. Synchronisation 
is required both within a round and between two successive rounds. The parallel speedup of the algorithm 
is proportional to the number of processors used. The parallel algorithm was implemented in C/C++ via 
POSIX Threads, and was tested on a machine with 8 processors. In the pricing of an American put option, 
the parallel speedup against an efficient sequential implementation was 5.26 using 8 processors and 1500 
time steps, achieving a parallel efficiency of 65.75%. 

Categories and Subject Descriptors: G.4 [Mathematical Software]; Algorithm design and analysis. Paral- 
lel and vector implementations 

General Terms: Algorithms, Design, Experimentation, Performance 

Additional Key Words and Phrases: Parallel algorithm, American option pricing, binomial tree model, trans- 
action costs 

1. INTRODUCTION 

An American call (put) option is a financial derivative contract which gives the option 
holder the right but not the obligation to buy (sell) one unit of a certain asset (stock) for 
the exercise price K at any time until a future expiration date T. Option pricing is the 
problem of computing the price of an opt ion, and is crucial to many financial practices. 
Since the classic work on this topic by [Black and Scholes [197 3 1 and M erton [19731, 
many new developments have been introduced. In this paper, we present a paral- 
lel algorithm and its multi-threaded implementation that computes the ask and bid 
prices of an American option when proportional transaction costs apply to the under- 
lying asset trading. Previous work on parallel valuation of Europea n and/or American 
options can be found in [Gerbessiotis 2004; G erbessiotis 2010; Gh uloum et al. 260T\ 
[Peng et al. 2010j ISolomon et al. 2010[ IZubair an d Mukkamala 20081, but they all as- 
sumed zero transaction cost in the underlying asset trading, which is often not the 
case. 

When the underlying transaction costs are considered, the no-arbitrage price 
of an American option is no longer unique, but is confined within an in- 
terval. The upper bound of this interval is the ask price of the option, 
and the lower bound the bid price. For an American option based on a 
sing le underlying asset, its ask price can be derived from Algorithm 3.1 
in IIRoux and Zastawniak 200911 , and its bid price fr o m Algorithm 3.5. Un- 
like the previous approaches liPerrakis and Lefoll 19971 IPerrakis and Lefoll 20001 



IPerrakis and Lefoll 20041 Boyle and Vorst 1992tlBensaid et al. 19921 in pricing Ameri- 



can/European options under transaction costs, the applicability of Algorithms 3.1 and 
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3.5 is not confined by the values of certain market and model parameters, or by meth- 
ods of settlement (cash or physical delivery of the underlying asset). Besides pricing 
vanilla options such as puts and calls, the algorithms can also be applied to the valua- 
tion of options with more complex payoffs, such as American bull spreads. 

The parallel algorithm that we present in this paper computes the ask and bid 
prices on recombining binomial trees, and was implemented in C/C++ via POSIX 
Threads. The implementation was tested on a machine with 8 processors (2 sockets 
X quad-core Intel Xeon E5405 at 2.0GHz). Experimental results showed that, for ex- 
ample, when the number N of time steps was 1500 the parallel speedup in pricing 
an Ameri can put option was 5.26. Compared to the results ob tained in the previ- 
ous work MGerbessiotis 20041 IGerbessiotis 20101 |Peng et al. 2010[ this multi-threaded 



approach reduces the overhead of parallelisation and gains speedups on problems of 
much smaller sizes. 

The contributions of this work are twofold. First, a parallel algorithm is de- 
signed and implemented which computes the ask and bid prices of American op- 
tions under proportional transaction costs, whereas previous work for the same 
problem did not take transaction costs into consideration. Second, a refined generic 
strategy for partitioning a recombining binomial tree is dev e loped. Like the 
previous partition schemes IGerbessiotis 2004[ IGerbessiotis 20101 |Peng et al. 2010 



IZubair and Mukkamala 200811 . our algorithm divides the whole tree into blocks con 
sisting of nodes from multiple levels (where each level in the binomial tree consists of 
nodes at a particular time step). Each of these blocks is further divided into regions 
which are assigned to distinct processors in each single round of the computation. 
The previous schemes fixed each processor's assignment from the start of the compu- 
tation. However, as the computation proceeds towards the root of the binomial tree 
the parallelism that can be exploited decreases. So, with a fixed assignment the load 
imbalance between different processors becomes more severe as the computation pro- 
gresses. However our partition scheme re-calculates each processor's workload before 
the start of each new round so as to minimise the imbalance. The partition scheme is 
generic in the sense that its applicability is not confined by the choice of the parameter 
values. 

The parallel binomial algorithm we developed is not specific to this particular prob- 
lem of pricing American options under transaction costs. In the appendix we show the 
application of this parallel algorithm in pricing American options without transaction 
costs. The source codes for both these two applications of the parallel binomial algo- 
rithm are freely available through email^. 

Organisation of the rest of the paper. Related work is reviewed in Section[2j The se- 
quential pricing algorithms are briefly explained in Section [3l The parallel algorithm 
and its analysis are presented in Section [4l Experimental results are reported in Sec- 
tion [5j Conclusions are drawn in Section [6l which also contains a discussion of future 
work. The appendix contains a discussion about applying the parallel algorithm to the 
pricing of American options with no transaction costs, and presents the results from 
the performance tests on the same machine. 

2. RELATED WORK 

Previous approaches in parallel option pricing are discussed in this section. None of 
this work took transaction costs into consideration. 

To exploit data-parallelism on recombining binomial/trinomial trees, a parallel op- 
tion pricing algorithm must partition a whole tree into blocks and assign them 
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to distinct proces s ors for parallel p 
IGerbessiotis 2010[ |Peng et al. 20lot 



recessing. Some approaches I 


Gerbessiotis 20041 


IZubair and Mukkamala 2008 


1 divided a bino- 



mial/trinomial tree into blocks consisting of multiple le vels of nodes, and processed the 
blocks using multiple processors. But some IKo lb and Pharr 2005[[Solomon et al. 201011 
processed nodes of a single level in parallel and afterwards moved to the next-highest 
level in sequential order. Compared with the latter method, the former requires more 
sophisticated synchronisation strategies and thus is more complicated to implement. 
But its advantage is that it causes less parallelisation overhead. The partition scheme 
w e designed in our al gorithm belongs to the first class. 

IGerbessiotis [2004| presented an architecture-independent parallel pricing algo- 
rithm for American and European-style options on recombining binomial trees. The 
algorithm partitioned a binomial tree into bxb blocks and assigned these blocks to dis- 
tinct processors in a wrapped-mapping manner such that the maximum input data 
imbalance between any two processors is limited by b. This assignment (Fig. 5 in 
IGerbessiotis 20041 ) was determined from the start of the computation according to the 
number of leaf nodes at level N and the number p of processors involved. The computa- 
tion on the whole binomial tree was divided into rounds, where in each round b levels 
of the tree were processed. No load re-balancing was applied after each round of the 
computat ion. The paralle lisation was achieved via the Oxford BSP (Bulk Synchronous 
Parallel) jBisseling 2004|| Toolset, BSPli b, and another non-BSP message passing in- 
terface (MPl) LAM-MPl IBurns et al. 19 941. The implementation was tested on a clus- 
ter of 16 PC workstations, each running a dual-Pentium 350 MHz. Their tests showed 
that when N = 8192 and b — 128, using the B SPlib, the parallel speedup was 2.71 when 
p ~ 8 and 3.19 (Table 1 in IGerbessiotis 2004ll ) whenp = 16. When implem ented via the 
L AM-MPI, the spee dup was 2.23 and 2.28 (Table 5 in IGerbessiotis 2004D . respectively 
Peng et al. [2010} presented a parallel option pricing algorithm based on a Backward 



Stochastic Differential Equation (BSDE). The computation was performed on binomial 
trees that model the Brownian dynamic change of the underlying asset price. The 
algorithm assumed the number N of time steps and the number p of processors to be a 
power of two. To avoid frequent communications they introduced a parameter L such 
that in each iteration of the computation L levels of nodes were processed in parallel. 
Their algorithm assumed that L was a pow er of two plus one and was divisible by 
L — 1. Each processor's assignment (Fig. 2 in | Peng et al. 2010 1) was fixed at the start of 
the computation. No load re-balancing was attempted afterwards. The algorithm was 
implemented in C via MPI. Tests were made on a cluster of 16 PC nodes where each 
node ran 2 Intel Xeo n DP 2.87 GHz. T he parallel speedup was 3.15 using 8 processors 
and 3.33 (Table 1 in |Peng et al. 2010} ) using 16, wh en TV = 8192 and L = 9. 

A GPU-based (graphics processing unit) solution IDai et al. 20101 to the BSDE ap- 
proach for option pricing w as presented by th e same group of researchers, where they 
adopted the theta method IZhao et al. 20061 to solve BSDEs. (The theta method dis- 
cretises a continuous BSDE on a time-space grid. At each node of the grid Monte 
Carlo simulations are used to approximate the mathematical expectations. The whole 
process requires a large amount of calculations but suits the computing architecture 
of a GPU.) The implementations were tested on a 2.67 GHz Intel Core 17 920 and 
an NVIDIA Tesla C1060. When N = 128 the runtime of the sequential code was 
about 23000 se conds, and that of the GPU code was about 99 seconds (Table 1 in 
IDai et al. 20Toll)! 



Zubair and Mukkamala [2008| proposed a cache-friendly parallel option pricing al- 



gorithm for shared memory symmetric multi-processors (SMP). The algorithm gave 
much consideration to the memory hierarchy available in modern RISC processors. 
In order to be cache-efficient the algorithm employed techniques such as cache and 
register blocking, and partitioned a binomial tree into triangular and quadrangu- 
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lar blocks (Fig. 8 in llZubair and Mukkamala 200811 ) . As the computation proceeded 
towards the root of the tree the number of blocks decreased and so did the num- 
ber of processors that could be utilised. The algorithm w as implemented in Fortran 
95 with parallelisation achieved via OpenMP directives |Garg et al. 2001 1. A test of 
the parallel algorithm on 8 Sun UltraSPARC III 1050MHz processors showed that 
wh en the block size was 128, N = 8192 the parallel speedup was 4.96 (Table 4 
in llZubair and Mukkamala 20081 ) using all the 8 processo rs. A similar seria l cache- 
friendly option pricing algorithm was discussed by Savage and Zubair [2010) . It was 
based on the binomial and trinomial models without parallelisation of any type. 

As a supplement to t he latency-tolerant BSP-oriented algorit hms for option pric - 
ing on binomial tre es in IGerbessiotis 2004]| and trinomial trees in MGerbessiotis 2003]| . 
IGerbessi otis [20101 presented a more up-to-date parallel algorithm using the explicit 
finite difference method [Schwartz 197^, which is equivalent to computing discounted 
expectations on a trinomial tree. The algorit hm partitioned th e nodes of a trino- 
mial tree into rec tangular blocks (Fig. 4 in MGerbessiotis 2010)1 ) of b levels. As in 
IGerbessiotis 20041 . the nodes of a block were further divided into three regions, one 
for nodes for which the discounted expectations have already been computed, one for 
nodes for which the computation does not depend on the results from nodes in a neigh- 
bouring block, and one for nodes for which such dependency exists. The algorithm was 
implement ed via the Oxford BSP Toolset, the non BSP-specific libr aries LAM_MPI and 
Open MPI IIGraham et al. 20051 . and SWARM IBa der et al. 200711 . a parallel comput- 
ing fra mework for multi- core processors. Their tests were done on the same PC cluster 
as in llGerbessiotis 20041 and on two multi-core processors. On the 2.4GHz Intel quad- 
core Q6600 used in t heir tests, the para llel speedup was 3.63 using BSP and MPI, 
and 3.13 (Table 11 in llGerbessiotis 20101 ) using SWARM when N = 8192, b = 129 and 
p = 4. 

jGhuloum et al. [2007| published a white paper where parallel binomial option pric- 
ing was implemented in Ct^, a data parallel API implemented within the C++-based 
syntactic framework. The parallel code was tested on two 2.33GHz Intel Xeon quad- 
core E5345, and gained much speedup over a sequential C++ implementation thanks 
to Ct's built-in SSE-bas ed implementation for the common math functions. 

ISolomon et al. [20101 presented a GPU-based parallel solution for pricing American 
lookback options on recombining binomial trees. The algorithm did backward compu- 
tation on a binomial tree with nodes at each level being processed in parallel. Ini- 
tially, the computation was carried out by the GPU, but after a certain threshold 
level was passed the computation was taken over by the CPU, because as the cal- 
culation proceeded to the root of the tree the parallelism that could be exploited de- 
creased. Their tests were performed on a 3.0GHz Intel Core 2 Duo and a 216-core 
NVIDIA GTX 260. The speedup of the CPU+GPU hy brid implementation against an 
un-optimised sequential code was about 20 (Fig. 7 in [Solomon et al. 20Tol ) when the 
number of time steps was 5000 and the threshold was set as 256. The same parti- 
tion scheme was used by the GPU-based parallel binomial option pricing discussed by 
|Kolb and Pharr [2005] , where the nodes in each single level of a tree were processed 
i n parallel. 



Huang and Thulasiram [2005 1 presented a parallel algorithm for pricing basket 
American-style Asian options on recombining binomial trees. The number of levels 
in a tree and the number of processors were assumed to be a power of two. To partition 
a tree, initially, all leaf nodes were evenly distributed among the processors. The com- 
putation proceeded to the root of the tree in such a way that in a given processor for 



^After Intel's merge with RapidMind technologies, Ct became a part of what is now known as the Intel Array 
Building Blocks (ArBB). 
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every pair of adjacent nodes at a certain level i t he processor computed the opti on price 
for the pair's parent node at level i — 1 (Fig. 3 in [Huang and Thulasiram 2005 1). Even- 
tually processor computed the option price at level 0. No load re-balancing ariiong the 
processors was attempted during the course of the computation. The implementation 
of the algorithm was in C via MPI. 

Compared with these parallel approaches in binomial option pricing, the generic 
partition scheme in our algorithm makes ample allowance for minimising the load im- 
balance between processors to enhance the efficiency of the parallelisation. The multi- 
threaded implementation of the algorithm is light-weight: the parallel speedup on 8 
processors in a tested American put option is 5.26 when N = 1500. 

Algorithms for parallel option pricing based on models other than the bino- 
mial/trinomial tree c an be found as we ll. These are loosely connected to what we 
present in this paper. [Fusai et al. [2010| published a numerical procedure for pricing 
exotic path-dependent o ptions when the underlying asset price evolves according to 
a generic Levy process BSchoutens 2003L By geometric randomisation of the option 
expiration, the n-step backward recursion in option pricing was transformed into an 
integral equation. The option price was then obtained by solving n independent inte- 
gral equations. Because the equations were mutually independent they were solved in 
p arallel on a gri d computing architecture. 

ISurkov [20101 1 presented algorithms based on the Fourier space time-stepping 
method to price single- and multi-asset European and American options with stock 
prices following exponential Levy processes. The algorithms were implemented on an 
N VIDIA GeGorce 9800 GX2 video card with only one of the two GPUs being used. 

[Prasain et al. [2010| proposed a parallel synchronous option pricing algorithm to 
price simple European options using particle swarm optimisation: a nature-inspired 
gl obal search a lgorithm based on swarm intelligence. 

|Sak et al. [20 07 [ discussed the application of parallel computing in pricing 
backward-starting fixed strike Asian options that are continuously averaged. Through 
a change of numeraire they transformed the pricing problem into solving a one-state- 
variable partial differential equation (PDE) by both explicit and Crank-Nicolson's im- 
plicit finite-difference methods. The algorithms they designed were implemented via 
MPI and were tested on a Linux PC cluster. 



3. THE SEQUENTIAL PRICING ALGORITHMS 

We first briefly go through the idea of pricing American options when transaction costs 
are not included. Consider an American put option with strike K and expiry T, which 
can be exercise d onc e at any time 0, 1, 2, . . . , iV. We use the one-step binomial process 
example in Fig. [] |ta)l where at time t the price of the underlying stock of an American 
put option is St- After the one time step, the price of the stock can either be uSt or 
dSt ~ u~^St. We assume the interest rate over the one step time period is p, that is, 
1 unit of cash bond at time t will grow to ?■ = 1 + p units at time t + 1. The risk- 
neutral probability for the up-move is p = {r — d)/{u — d), and for the down-move is 
1 ~ p. The payoff of the American option at time t is Pt = max(/^ — St,0); that is, if 
5** < K then the owner of the option will exercise his/her right to sell one unit of the 
stock (worth St) at the price K, thus making a profit of K - St, and if St > K then 
exercising the option is not advantageous. The option is priced by backward induction, 
which gives a unique arbitrage-free price i^t for the option at time t. At the maturity 
date N the value of the option is the same as its payoff, so tt^v = Pn- For t < N, the 
value TTt of the option at the node St is the maximum of its discounted expected payoff 
r^^E{TTt+i\St) = r^^{pTTf^i + (1 - p)T^t+i) at time t and its immediate payoff Pt if the 
option is exercised at t, tha t is n t = max(Pf,E(7rf+i|S't)/r). For example, p = 0.9454 for 
the parameter values in Fig.[] |ta)i which also shows the option payoffs for K = 130. Now 
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(S^ ,Sf) = (80, 120) _ cjua ) ^ (9g_ 144) 



(•Sm.^?m) = (66.67, 100) 



(a) Without transaction costs. 



(b) With transaction costs. 



Fig. 1: One-step binomial processes with and without transaction costs. 



suppose that the option prices at the nodes at time t + 1 have already been computed 
and happen to coincide with the corresponding payoffs, Trj"^^ = 10 and tt^^ = 46.67 (that 
is, in this example both these nodes are in the exercise region for the American option). 
Then we can compute nt = max(30, 10.17) = 30. To compute ttq on a binomial tree of 
multiple levels, we start from the leaf nodes and go all the way back to the root to 
obtain the price of the option at time 0. 

Proportional transaction costs in asset (stock) tradings are modelled by bid-ask 
spreads. That is, at time t a unit of stock can be bought for the ask price Sf or sold 
at the bid price S^. To link this with the friction-free model, we shall assume that 
Sf = (1 + k)St and = (1 — k)St, where k G [0, 1) is the transaction cost rate. Un- 
der these conditions the arbitrage-free price of an American option at any time t is no 
longer unique, but is confined within an interval. The upper limit of this interval is 
the ask price nf of the option, and the lower limit the bid price ttJ'. The ask price is the 
price at which the option can be bought on demand. It is also the minimum amount 
of wealth that the seller of the option needs to hedge his/her positions in all circum- 
stances, that is, to deliver to the buyer the obligated payoff portfolio without having 
to inject extra wealth. The bid price is the price at which the option can be sold on 
demand. It is also the maximum amount of wealth that the buyer can borrow against 
the right to exercise the option. 

Let C) be the payoff process of an American option, that is, if the holder exercises 
the option at any time t = 0,1,2, ... ,N, then the seller must deliver to the holder a 
portfolio consisting of cash and (t units of the asset (stock). For the above American 
put option this is {K, —1) at all times. To hedge his/her position the seller should hold 
a portfolio consisting of cash and the underlying stock, and we use {xt,yt) to denote 
his/her holdings of cash and stock at time t. We define the seller's expense function ut 
at time t to be 



where (y -~ (t)~ = - min(y - Ct,0) and {y - (t)^ — max{y - (t, 0). This is a function of 
the seller's stock holding at time t. It defines the minimum amount of cash that the 
seller needs at t to fulfil his/her obligation if the option is exercised at t. So if the seller 
wishes to form a self-financing strategy to cover his/her position at t, his/her holdings 
{xt,yt) must belong to the epigraph of ut, that is {xt,yt) G epiwt. (The epigraph of any 
function / is the set of points which lie above / in the yx-plane, namely epi / = {{y, x) G 



Now using the same Ame rican put example (with K — 130) and the one-time step 
binomial process (Fig. [l |tb)| >, we explain how the option ask price vrf is computed at 
any time t. This is done by constructing a sequence of piecewise linear functions zt by 
backward induction from time step N, when = un. The interpretation of zt is that 
a portfolio (a;, y) at time t allows the seller to deliver the option without risk if and only 
if (x, y) G epi zt. For t <T,we start from the two nodes at time t + 1. Suppose that 



My) = 6 + (2/ - Ct)-sf ~iy~ c^)+5^ 



(1) 



R^\x> f{y)}.) 
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Fig. 2: The piecewise linear functions in computing the ask price. 



zJVi(y) = «JVi(2/) = 130 + lU{y + 1)- - 96(y + 1)+ (2) 
at the up-move node. This is a piecewise hnear function because 

z" M-[ -144y-14 V<-1 . .3. 

zt+i{y} - I _96y + 34 2/ > -1 ' ^ ' 

see Fig. HJ For the down-move node suppose that 

4+M = <i(y) = 130 + 100(2; + 1)" - 66.67(2/ + 1)+. (4) 

At time t, because the seller must be prepared for the worst case scenario, we cal- 
culate the maximum of zj^^ and zf^,-^, to obtain lut = niax(zj^j, 2^1). Now since x units 
of cash at time t will grow to xr at time t + 1, the function wt must be discounted by 
r. Now the slopes of this discounted function wt/r must be restricted within the in- 
terval [-Sf, —S^], which is [-120, -80] in this example, to account for the possibility 
of rebalancing the portfolio at time t. This restricted function is denoted by vt . It is 
the discounted expected expense function at time t. The epigraph of this function con- 
sists of portfolios covering the option seller if the option is exercised at time t + 1 or 
later. Now what if the option is exercised at time tl The expense function ut at t is 
ut = 130 + 120(2; + 1) ^ 80(2/ + 1)+. Again, the seller must be prepared for the worst, 
which corresponds to the expense function being the maximum 

zt{y) = max(ut(2/),wt(2/)) = ut{y) | _802!+"50^ 2/ > -1 ' 

These piecewise linear functions are shown in Fig. [2j The option ask price at time t for 
this example is then 7rf ~ zt(0) = 50, because it is the minimum amount of cash that 
enables a seller without a stock holding to hedge his/her position without risk at time 
t. When the above computation is carried out on a binomial tree representing N time 
steps, we start from the leaf nodes and work backwards to the root node at time 0. The 
option ask price is then 7rg = zo(0). 
For the buyer's case, the buyer's expense function at time t is 

utiv) = -6 + (2/ + Qt)-st - (2/ + Ct)+^?, (6) 
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Fig. 3: The piecewise linear functions in computing the bid price. 



because it is he/she who will receive the portfolio (Ct,Ct)- The pricing procedure for 
Zfsi, wt and vt is similar to that for the seller. But when zt is computed the minimum 
operation is used instead of the maximum. The reason for this difference is that at 
any time t < N the buyer needs to choose between exercising or waiting (and choose a 
portfolio in cpi ut or cpi vt), whereas the seller needs to be prepared for any eventuality 
(i.e. they need a portfolio in cpi ut and cpi vt). In this example, if it is assumed that 

= -130 + 144(2/ - 1)" - 96(2/ - 1)+, 

4i(y) = -130 + 100(2/ - 1)" - 66.67(2/ - 1)+, 

then 

ztiy) = mm{ut{y),vt{y)) = ut{y) = | _gOy^_ 59^ y>l' ^'^^ 

The option bid price ttJ' at time i is Trj?" = ~zt{0) = 10, because the bid price is the 
maximum amount of wealth that the buyer can borrow against the right of exercise. 
See Fig.[3]for a plot of the piecewise linear functions in the buyer's case. 

Full details of the procedures for finding bid and ask pr ices under proportional trans - 
action costs can be found in Algorithms 3.1 and 3.5 in IRoux and Zastawniak 200911 . 
Note that in general irt e [ttJ' , nf] . 

4. THE PARALLEL PRICING ALGORITHM 
4.1. Binomial tree model 

For an American option whose payoff process and physical expiration time are (^, 
and T, respectively, let N be the number of time intervals that discretise the time pe- 
riod from to T. Also let a be the volatility of the underlying stock, R the continuously 
compounded annual interest rate and k G [0, 1) the transaction cost rate. Under such 
conditions the binomial tree that models the dynamics of the stock price will have iV+ 1 
levels, corresponding to the time steps t = 0, 1, 2, . . . , A^. The up-move factor u, down- 
move factor d and cash accumulation factor r over one time step are u = cxp{ay^T/N), 
d = = exp(— q-^/T/A), a nd r = cxp(i?r/A), respectively. The pricing algorithms in 
IRoux and Zastawniak 20^9 1 actually add an extra time instant f = A + 1 to the model 
and set the option payoff as (0, 0) at all the A + 2 nodes in that level. The purpose 
of adding this extra time step is to model the possibility that under certain circum- 
stances it may be in the best interests of the option holder to leave it unexercised. In 
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line with MPerrakis and Lefoll 20041| and URoiix and Zastawniak 20091 . we assume that 
no transaction costs apply at time 0, that is, Sq = Sq = Sq. 

4.2. The partition scheme and the synchronisation mechanism 

Assume we have p distinct processors in a parallel computer. Because the computation 
of the u,w,v,z functions at different nodes can be performed independently in parallel, 
we can partition a whole binomial tree into blocks of nodes and assign these blocks to 
distinct processors. The parallel algorithm, like its sequential counterpart, starts off 
at the leaf nodes where t = N + I and works backwards towards the root of the tree. 
The whole process is accomplished by p threads, denoted by po, Pi, • • • ,Pp-i, with each 
thread being bound explicitly to a distinct processor. The whole computation is divided 
into rounds, where in each round the nodes of a block are processed by the p threads 
in parallel. 

In general, if the base level B (whose nodes have been processed in the (i — l)th 
round) of an ith round is at time t = n,n e [1,N + 1], then the total number of nodes in 
at that level will be n+1. These n+1 nodes will be divided equally among the p threads. 
So all the threads p^, i = 0, 1, . . . ,p-2, get l{n + l)/p\ nodes, but the last thread pj^i gets 
(n + 1) — [{n + l)(p — l)/pj nodes. We use L to denote the maximum number of levels 
that are processed towards the root in a round, that is, the maximum number of levels 
in a block. However, the number D of levels that are actually processed in a round is 
jointly determined by L and the number of nodes that each thread gets, because this 
number D cannot exceed [(n + l)/p\ — 1. So we have D = min(L, [(n + - 1). So 
in a round whose base level B contains n + 1 nodes all the threads will be assigned a 
block of [{n + l)/p\ X D nodes, except the last thread pp-i which only gets a smaller 
number of nodes. For a thread pi,i e [0,p - 2], we further divide its + l)/p\ x D 
nodes into region A and region B such that the computations performed at the nodes 
in region A do not depend on the results from another thread in the same round, but 
the computations at the nodes in region B do need results from thread p^i . Note that 
the last thread pp-i does not have any B nodes in any round of the computation. 

Fig. |4] shows such a division among 3 threads in a round consisting of 3 (L = I? = 3) 
levels of nodes. The nodes enclosed by the dashed frame box at time level t + 3 are 
the base nodes. For thread pa, to compute the u,w,v,z functions at the nodes in its 
region B, it needs results computed by thread pi at the two nodes in column 4 enclosed 
by the thin frame box. Thread po cannot start computing at the nodes in its region 
B until thread pi finishes at the node (level t + 1, column 4) enclosed by the bold 
frame box. In general, thread pi, i e [0,p — 2], cannot start at nodes in its region B 
until thread pi+i finishes at the leftmost node at level B — D + 1 i n its region A. This 
sche me of partitionin g into A and B regions was also adopted in IGerbessiotis 2004]| 



and IPengetal. 2010 1. 
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The parallel algorithm re-balances the workload of each thread after each round 
of the computation. If the current base level is B, the next base level will he B ~ D, 
containing B—D+1 nodes, and according to this number the workload of each thread in 
the next round will be calculated. The parallel algorithm ensures that each thread will 
get minimally two nodes to process in all the rounds, which means that the minimum 
possible value that D can get is 1. If at some level of the tree the number of nodes 
is less than 2p the number of processors used will be decreased by 1 until this no- 
less-than-two-node condition is satisfied. A partition based on the above explanation 
is shown in Fig. [5] for N = \Q, p = i and L = i. The figure shows the adjustment of 
the workload after each round and the reduction in the number of processors needed 
as the computation proceeds towards the root of the tree. 

To save the intermediate z functions generated during the computation, instead of 
generating the whole tree, the parallel algorithm maintains two buffers, each with 
{L + 1) rows X (A^ + 2) columns. One of these two buffers is for computing the ask 
price, and the other the bid price. The mapping between a whole binomial tree and 
the buffers is done in a modular wrapping around manner to avoid the cost of extra 
synchronisation and copy back. We use variable U to denote the base level in the two 
buffers in a round of the computation, corresponding to the base level of the tree. 
Initially, this U is set to 0, and after a round whose base level of the tree is B and 
works D levels towards the root, U is updated by ?7 ^ {U + {B - D)) mod(L + 1). 
Now suppose the computation is working on the ith, i e [0, D], level down from level B. 
The piecewise linear functions will be computed and stored in the two buffers at level 
([/ + i) mod {L + l) according to the piecewise linear functions stored at level [U + i — 
mod (i + 1) in the buffers. 
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Compute functions u.w.v, z 
at a node in region A 




Gi <— 1, notifies 
tliread pi-i 

done for region A 

1 < p 




Compute functions u,w,v, z 
at tlie nodes in region B 










Broadcasts, 


all finished 




update parameters 





Fig. 6: The synchronisations on 
thread pi , i e [0, p — 1] . The con- 
dition in the first rhombus box is 
shown at line 15 in Algorithm[l] 



As the whole computation is divided into rounds, the threads have to be synchro- 
nised both within a round and between two successive rounds. Within a round, all the 
threads work D levels down the tree in such a way that any two adjacent threads have 
to be synchronised. As soon as thread pi,i e — 1] has finished the leftmost node 
(such as the single node enclosed by the bold frame in Fig. [S]) at level B-D + l{B being 
the base level of the round) in its region A, it will send a signal G, to thread pi-i, so 
that after thread pi-i has finished the nodes in its region A, upon receiving the signal 
it can proceed to the nodes in its region B. Once thread pi,i e [l,p — 1] has finished 
processing all its nodes in regions A and B, it has to wait for the other peer threads 
to finish their work. Only after all the threads have finished, can the parameters be 
updated for the next round. The flow chart in Fig. [6] using thread pi, i G [0,p — 1] as an 
example shows the synchronisation scheme. 

The pseudo code in Algorithm [l] shows the computational steps performed by thread 
Pi,i e [0,p — 1], including the synchronisation scheme. Note that node {l,c) there de- 
notes the node at level / of the tree whose column index is c. The nested for-loop that 
computes the functions at nodes in region B is similar to the one in region A, so the de- 
tails are omitted. The pseudo code is executed by all the threads pi, i = 0, 1, 2, . . . , p - 1. 
Because thread pq is the one that computes zq at the root node at t = 0, the option ask 
and bid prices are returned by thread po- We finally have ttq ~ zq{0) and ttq = -Zq(0), 
where zq is the seller's expense function at t = and Zg the buyer's. 
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Algorithm 1: Computational steps executed by thread Pi, i £ [0,p — 1]. 

Input: Up-move factor u, interest rate r, number p of processors, number N of time steps, stock price 

5o at root node, transaction cost rate k. 
Output: The functions u, w, v and z for the seller and the buyer at each node. 

1 begin 

// Initialisation at nodes in level t = N + l. 

n <- TV + 1, s <- i X [{n + 

e <— (« + 1) X [(n + Vi^p — 1, ore-«— n + 1 when i = p ~ 1; 

for i ^ s; i < e; Z Z + 1 do 



with payoff (0,0) for both the parties at node (A'' + 1, 1); 



// Start to work backwards down to the root. 

(/ <— 0; // Used for the mapping from the tree to the buffers. 

for _B <- AT + 1; S > and i < p; do 

D ^ min(L, [(n + - 1); 

o-f- 1;// o is the column offset for region A. 

if D > 1 then T <- _B - D + 1; else T ^ B - D; / / T is the level on which the signal 
Gi is triggered. 

// Compute functions u, w , v and z at the nodes in region A. 
for C ^ B - 1; C > B - D; C <- C - 1 do 
m -f- min(e — o,C + 1); 
for Z s; Z < m; / •(— Z + 1 do 

Compute zj-, for both the parties at node (C, Z); 
if i > and C = T and Z = s then 
|_ Gi ■<— 1, and signal the change to thread pi-i ; 

o <s— o + 1; 

// Compute functions u, w, v and z at the nodes in region B. 
if j + 1 < p then 

Block until signal Gn^i becomes 1; 
Gi^i <— 0; // Prepare for the next round. 

Compute the z functions at nodes in region B from level _B — 1 to -B — D whose column 
indexes are within [s, e — 1]; 

Wait until all threads reach this point; 

// Start to update the parameters for the next round. 
B ^ B ~ D,U ^ {U + D) mod (L + 1); 
if B > then 

n <— B + 1; // n is the number of nodes at the next base level, 
while n < (2 X p) do 

1^ p <— max(p — 1, 1); 

s « X [(n + l)/pj; 

e ^ (« + 1) X [(n + l)/pj VjT^p — 1, ore<— n + 1 when i = p — 1; 



30 end 



4.3. Computational time analysis 

Algorithms 3.1 and 3.5 in MRoux and Zastawniak 20091 have polynomial runtime Ts = 
0{N'^) for some k >2. Although the number of nodes in a recombining binomial tree is 
quadratic in (so a traditional binomial pricing algorithm without transaction costs 
has runtime Tg = 0(iV^)), the maximum, minimum and slope restriction operations 
may require slightly more time to finish as the computation proceeds towards the root 
because the piecewise linear functions u, w, v and z may acquire more linear pieces 
at nodes closer to the root. To see the runtime Tp of the parallel algorithm (Algorithm 
and the parallel speedup S = Ts/Tp we start by estimating the number of nodes 
processed by thread po on the whole binomial tree. 
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Fig. 7: An estimation on the 
number of nodes processed by 
thread po • 



Generally, in a round whose base level is B and has n nodes, all these p threads 
work in parallel on D levels of the tree, from level B — 1 to B — D. According to the 
algorithm, the nodes within these D levels will be divided into p blocks, and the number 
of nodes assigned to thread po is nD/p. The total number of nodes within these D levels, 
assuming n is an integral multiple of p, is nD — . (See Fig.[7]for an example.) So 

the fraction done by thread po is nD/p divided by nD - which is p(2-(g+i)/n) • 

For large n and relatively small D, we can assume that (Z? + l)/n « 0, and, therefore, 
the fraction processed by thread po is approximately I /p. This roughly applies to the 
part of the tree from the leaf level (t = iV + 1) to the level where t = 2p — 2 (p > 1), 
because beyond this level further down the tree the number of processors needed will 
decrease. The total number of nodes in the tree from level ^ = iV + l(A^ + 2 nodes) 
to level t = 2p-2(2p-l nodes) is {N + 2p^ 1){N - 2p + 4)/2, of which the number 
processed by thread po is {N + 2p+ 1){N - 2p + 4)/2p. For the levels beyond t = 2p-2, 
because thread po will always have 2 nodes to process except at level i = 0, the total 
number processed by po will be 4p — 5. Therefore, for the whole binomial tree from t = 
to t iV + 1, the total number of nodes processed by po is i^+^p+^K^-2p+i) _^ 4^ _ 5 if 

we assume N » 2p, then iN+^P+im-^P+^) + 4p - 5 « Ny2p. 

To verify the validity of this estimation we have compared this estimated number 
with the actual counts obtained from several executions of the parallel algorithm. The 
data are summarised in Table Jl The error rates are calculated and reported as well, 
from which it can be seen that the estimation is very close to the actual count in all 
the cases. For a fixed p and L {D = min(L, [{n + l)/p\ — 1), jointly determined by L, p 
and n), as the number N increases the error rate decreases. This also is in-line with 
our analysis. 

Now since thread po processes about N'^/2p nodes, and the total number of nodes in 
a recombining binomial tree (from t = to t = N + 1) is {N + 3){N + 2) /2 ^ N'^/2, 
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Table I: A comparison between N"^ /2p and the actual number of nodes processed by thread po when L = 5, 
The fraction part of N'^ /2p is omitted. 



p 




N = 1200 






N = 1350 






N = 1500 






Actual 


N'^/2p 


Error 


Actual 


N'^/2p 


Error 


Actual 




Error 


2 


362,999 


360,000 


-0.83% 


458,999 


455,625 


-0.74% 


566,249 


562,500 


-0.66% 


4 


181,198 


180,000 


-0.66% 


229,161 


227,812 


-0.59% 


282,748 


281,250 


-0.53% 


8 


90,311 


90,000 


-0.34% 


114,255 


113,906 


-0.31% 


141,008 


140,625 


-0.27% 



Core 


Core 4 


Core 2 


Core 6 




Core 1 


Core 5 


Core 3 


Core? 


6MB L2 


6MB L2 




6MB L2 


6MB L2 


: 


FSB > 


: 


1 


FSB . 


: 


10.66 GB/s: 










: 10.66 GB/s 




Memory controller hub 





1 



Memory modules 



Fig. 8: The parallel machine used in the tests. 



SO the time required by po is roughly 1/p of the sequential runtime. The sequential 
runtime Ts is Ts = 0{N^) for some fc > 2, and so the parallel runtime Tp is Tp = 
Ts/p = 0{N^)/p= 0{N^/p). The parallel speedup S is therefore S = Ts/Tp = 0{p), 
proportional to the number p of processors used. So we can conclude by this analysis 
that the proposed parallel algorithm is cost-optimal in that pTp = 0{N'') having the 
same asymptotic growth rate as the sequential algorithms. 



5. EXPERIMENTAL RESULTS 

The parallel pricing algorithm was implemented in C/C++ via POSIX Threads, and was 
tested on a machine with dual sockets x quad-core Intel Xeon 2.0GHz E5405 running 8 
processors in total (Fig.O. The source code was compiled by Intel C/C++ compiler icpc 
12.0 for Linux. The testing machine was running Ubuntu Linux 10.10 64-bit version. 
The POSIX thread library used was NPTL (native POSIX thread library) 2.12.1. 

To verify the correctness of the parallel algorithm we computed the ask and bid 
prices for the same American put opti on and the American bull sp read described in 
Examples 5.1 and 5.2, respectively, in BRoux and Zastawniak 20091 . In the American 
put example, the parameter values were T = 0.25, a = 0.2, R = 0.1, So = 100, K = 100, 
N varied from 20 to 1000 and k from to 0.02. The American bull spread consists of a 
long call with K = 95 and a short call with K = 105, and is assumed to be settled 
in cash, with payoff process {St - 95)+ - {St — 105)+. In all the cases the parallel 
implementation produced exactly the same figures as reported in Table 1 and Table 
2 in fRoux and Zastawniak 20091. 

To see the effect that proportional transaction costs have on option prices, we com- 
puted the prices for the same American put option (with K ~ 100) but with 5o varying 
from 90 to 110 under three rates ka = 0, fci = 0.25% and ^2 = 0.5%. The curves of the 
option prices tt^ and tt]!. are plotted in Fig. [9l where it can be seen that for any fixed Sq 
we have -k^^ < < 7r|^^ = tt^^^ < nf:^ < 7r|^ . Note that the larger the transaction cost 
rate k the greater the ask-bid spread of the option price. 
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Fig. 9: Ask and bid price 
curves under different 
transaction cost rates. 



Current stock price 5*0 

To test the performance of the parallel algorithm against an optimised implemen- 
tation of the sequential algorithms we performed two additional sets of tests where k 
was fixed to 0.005 for the American put option and to 0.01 for the bull spread, N varied 
from 450 to 1500, and p from 2 to 8. The runtimes and speedups are reported in Table 
nil All the times were wall-clock times measured in milliseconds (ms). 

Moreover, the serial and parallel runtimes when p = 8 and the parallel speedups 
when N = 1500 are plotted in Fig. Iiq[a)| and Fig. Iiq[b)[ respectively. The speedup 
curves are very close to straight lines and this supports our analysis that the parallel 
speedup S is proportional to p. Tests for other values of L were performed in which 
very close results were found. 

From the speedup ratios we calculated the parallel efficiency E = S/p. The analysis 
indicates that S = 0{p), so E = S/p = 0{p)/p = 0(1), which means that the efficiency 
of this parallel algorithm should stay the same no matter how many processors are 
used. However, in practice, because the synchronisation cost grows as the number p 
increases, we can expect that the efficiency will decay a s mor e processors are used. 
The efficiency data are plotted as dashed curves in Fig. Iiq[b)l where it can be seen 
that the efficiency diminishes only slightly as p increases. 

6. CONCLUSION 

We have presented a parallel algorithm (based on the sequential pricing algorithms 
proposed in I Roux and Zastawniak 20091) that computes the ask and bid prices of 
American options under proportional transaction costs, and a multi-threaded imple- 
mentation of the algorithm. Using p processors, the algorithm partitions a recom- 
bining binomial tree into multi-level blocks. The whole computation, starting from 
the leaf nodes and working backwards to the root of the tree, is divided into rounds, 
where in each of these rounds, a block of nodes is further partitioned and processed 
by multiple processors. Before the start of the next round the workload of each pro- 
cessor (thread) is adjusted according to the number of nodes at the next base level. 
The applicability of the partition method and the associated synchronisation scheme 
is not restricted by the values of the parameters N (number of levels of the tree), L 
(maximum number of levels processed in a round) or p (number of processors). The 
parallel algorithm has theoretical speedup S = 0{p) and is cost-optimal because 
pTp = 0{p) X 0{N^ /p) ~ 0{N^) for some k > 2, which has the same asymptotic 
growth rate as the serial runtime Tg. The parallel efficiency E of the algorithm is 
E^ S/p ^0(1). 

The implementation was tested for its correctness and performance. The results 
demonstrated reasonable speedups, e.g., 5.26 when p = 8 and N = 1500, against an 
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Table II: Runtimes and speedups from the parallel performance tests. 



p\S 


TV = 450 


Af = 600 


Ar = 750 


TV = 900 


TV = 1050 


TV =1200 


TV = 1350 


TV = 1500 


American put k = 0.5%, K = 100, So = 100, T 


= 0.25, R -- 


= 0.1, a = 


0.2, L = 5 






Serial 


181.0 


325.1 


498.6 




Q7Q Q 


1 Qn9 1 


1 f^aa A 


1 Q8Q 7 


p = 2 


128.7 
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348.7 






SQ9 1 


1 1 9Si 9 
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14U0.0 


S 
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1.42 


1.43 
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1.4D 


1 A'i 
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p = 3 


90.4 
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S 


2.00 
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2.06 


o 1 r\ 


9 no 
z.uy 


9 11 


9 1 n 
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z.iu 


p = 4 


68.6 


121.6 


184.0 


268.7 


355.1 


469.7 


581.2 


724.8 


S 


2.64 


2.67 


2.71 


2.66 


2.76 


2.77 


2.77 


2.74 


p = 5 


57.2 


96.8 


151.7 


213.0 


286.6 


374.7 


466.9 


583.7 


S 


3.17 


3.36 


3.29 


o.oO 


3.42 


3.47 


0.44 


0.4U 


p = 6 


50.0 


83.7 


132.2 




245.9 


313.9 


QQSi 7 


4QQ 9 


S 


3.62 


3.88 


3.77 


o.oi 


3.98 


4.15 


4.Uo 


A n9 

4.UZ 


p = 7 


43.5 


74.4 


115.3 


ioo.U 


214.7 


281.5 


OOO.O 


4ZO.O 


S 


4.16 


4.37 


4.33 


4.00 


4.56 


4.63 


4.4Z 


A 

4. DO 


p = 8 


40.4 


67.3 


102.8 




189.4 


248.6 


Q1 9 


Q7G Q 


S 


4.48 


4.83 


4.85 


o.uu 


5.17 


5.24 


0.10 


O.ZO 


American bull spread k 


= 1%, it = 


{St-95)+- 


-(5t-105)- 


^,T = 0.25, R = 0.1, cr 


= 0.2, L = 5 




Serial 


185.5 


327.9 


510.7 


( ol.Z 


989.7 


1291.4 


iDzo.y 


zuuo.z 


p = 2 


133.1 


233.9 


365.2 


p;99 9 

ozz.z 


699.5 


906.7 


1 1 p;9 1 


1 A99 Si 
14ZZ.O 


S 


1.39 


1.40 


1.40 




1.41 


1.42 


1/11 

i.4i 


1/11 
1.41 


p = 3 


95.9 


164.6 


254.7 


oDU.D 


486.2 


624.5 


7Q1 Q 


QQ9 9 


S 


1.93 


1.99 


2.01 


9 riQ 


2.04 


2.07 


9 ns 

Z.Uo 


9 n9 
z.uz 


p = 4 


76.3 


130.9 


203.1 


279.7 


369.8 


474.6 


596.3 


734.3 




2.43 


2.50 


2.51 


2.61 


2.68 


2.72 


2.73 


2.73 


p = 5 


64.1 


106.6 


166.5 


229.3 


305.0 


393.0 


498.6 


622.5 




2.89 


3.07 


3.07 


3.19 


3.24 


3.29 


3.26 


3.22 


p = 6 


56.2 


91.4 


142.6 


197.7 


261.7 


334.6 


419.4 


510.0 


S 


3.30 


3.59 


3.58 


3.70 


3.78 


3.86 


3.88 


3.93 


p = 7 


48.2 


80.9 


124.9 


171.7 


228.9 


291.9 


364.5 


444.4 


S 


3.85 


4.05 


4.09 


4.26 


4.32 


4.42 


4.46 


4.51 


p = 8 


47.3 


79.6 


121.5 


167.6 


215.1 


273.6 


337.7 


403.6 


S 


3.92 


4.12 


4.20 


4.36 


4.60 


4.72 


4.81 


4.97 



optimised sequential program even for problems of small sizes. The performance of 
the implementation was in-line with the asymptotic analysis. It showed that, because 
no inter-computer communication was involved, the overhead of the parallelisation in 
the multi-threaded implementation was much reduced compared to some previous ap- 
proaches based on message-passing interfaces. The parallel efficiency in the tests is 
seen to decay slightly as p increases. 

For options whose lifetime is short (within months) a relatively small number (usu- 
ally several thousand) of time steps may be sufficient to model the price changes of the 
underlying asset. To handle such cases the multi-threaded implementation on main- 
stream multi-core processors will normally be fast enough. But for pricing long-life 
options (expiring in years) where large numbers of time steps are needed the paral- 
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lei algorithm may have to be adapted to more powerful platforms, such as many-core 
general purpose graphics units. We are also aiming at developing high-performance 
parallel algorithms for pricing multi-dimensional options under proportional transac- 
tion costs. Since for such cases a direct implementation of the maximum, minimum 
and gradient restriction operations on multi-dimensional structures could be difficult, 
we may have to resort to Monte Carlo simulations, which are easily parallelised, and 
run them on large-scale parallel architectures. 



APPENDIX 

The parallel binomial algorithm we have developed is not specific to the problem of 
pricing American options under proportional transaction costs. It can be easily adapted 
to other problems, such as the case of pricing American options without considering 
transaction costs. In such cases, for an iV-step simulation the algorithm does not add 
an extra time step i = + 1 to the binomial tree. The other difference is that without 
transaction costs, all the payoffs and the expectations become scalars, and so the max- 
imum operations are performed on numbers rather than on piecewise linear functions. 
The runtime Ts of a sequential binomial American option pricing algorithm with no 
transaction costs is Ts = 0{N^). So the parallel runtime Tp = 0{N^/p). The parallel 
speedup S = 0{p), and the parallel efficiency E ^ 0(1). 
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Table III: Runtimes and speedups from the parallel performance tests - without transaction costs. 
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Without considering dividends and transaction costs the price of an A merican cal l 
option is the same with a European call option under the same conditions HHull 200911 . 
so we consider only an American put option. We tested on the 8-processor machine 
(Fig. (8) the performance of the parallel algorithm (modified in the two aforementioned 
aspects) using an American put option with strike K = 100 and where the parameters 
were 5*0 = 100, T = 3, cr = 0.3 and R = 0.06. In the test the number of time steps 
grew from 5000 to 40000, and the number p of processors from 2 to 8. All the numeric 
variables in the program were represented by 8-byte double-precision floats. The run- 
times and the speedups against an optimised sequential program are reported in Table 
Hn All the times were wall-clock times measured in milliseconds (ms). The computed 
price for the American put option was 13.906. 

The serial and parall el run times w hen p ~ 8 and the parallel speedups when N = 
40000 are plotted in Fig. ll]|fa)l and Fig. lll|(b)l resp ective ly. The parallel efficiencies were 
calculated from the speedups and plotted in Fig. Il]|[b)| as well. 

From the results we observed super-linear speedups in several test cases, e.g., when 
N = 30000, p = 3 and the speedup S = 3.35. This was caused partly by the caching 
effect. The serial program can only use one of the four L2 caches (Fig. |8ll, but the 
parallel program uses all the four. Moreover, the parallel program makes use of both 
the two FSBs, whereas the serial program uses only one. This also helps to increase 
the rate at which data is transferred between the main memory and the processors. 

In all the tests parameter L (the maximum number of levels being processed in 
a round) was set to 50, much increased from its value (L ~ 5) in the tests where 
transaction costs are present. The purpose of increasing its value was to reduce the 
number of times where the threads have to be synchronised, and therefore reduce 
the cost of the synchronisation. In the tests where transaction costs are considered, 
because the computation time was long enough relative to the synchronisation time, 
the performance was not as sensitive to the synchronisation overhead. 
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Fig. 11: Plots derived 
from the performance 
tests on the American 
put option without 
transaction costs. 
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